Skip to content

fix(source-s3): cap concurrent file readers at 20 to prevent OOM#78416

Draft
devin-ai-integration[bot] wants to merge 2 commits into
masterfrom
devin/1779724200-source-s3-fix-oom-concurrency
Draft

fix(source-s3): cap concurrent file readers at 20 to prevent OOM#78416
devin-ai-integration[bot] wants to merge 2 commits into
masterfrom
devin/1779724200-source-s3-fix-oom-concurrency

Conversation

@devin-ai-integration

Copy link
Copy Markdown
Contributor

What

Caps the source-s3 concurrent file reader count at 20 to prevent OOM failures on large S3 streams within the 2 Gi container memory limit.

Resolves https://github.com/airbytehq/oncall/issues/12714:

The concurrent file-based cursor migration (#78325) enabled 100 concurrent file readers by default for source-s3. On large S3 streams, the buffered records from 100 workers exceed the container's 2 Gi memory limit, causing MemoryMonitor to trip at 95% usage or kernel OOM kills (exit 137).

How

Set _concurrency_level = 20 on SourceS3, down from the CDK default of 100. This attribute is now respected by FileBasedSource.__init__ (see companion CDK PR: airbytehq/airbyte-python-cdk#1035) when creating the ConcurrentSource thread pool.

Also removed the unused DEFAULT_CONCURRENCY import that was previously needed.

Review guide

  1. source_s3/v4/source.py_concurrency_level = 20 replaces DEFAULT_CONCURRENCY
  2. unit_tests/v4/test_source.py — new test verifying concurrency level is set to 20
  3. metadata.yaml / pyproject.toml — version bump 4.15.4 → 4.15.5
  4. docs/integrations/sources/s3.md — changelog entry

User Impact

Source-s3 syncs on large S3 streams will no longer OOM. Throughput is reduced from 100x to 20x concurrent readers (vs legacy single-threaded), which still provides significant speedup while staying under the 2 Gi container limit.

Can this PR be safely reverted and rolled back?

  • YES 💚

Link to Devin session: https://app.devin.ai/sessions/ff881275041b469f9a6aed60a6af0fe2

The concurrent file-based cursor migration (PR #78325) enabled 100
concurrent file readers by default. On large S3 streams this causes
the source to exceed the 2 Gi container memory limit.

Reduce _concurrency_level from 100 to 20, which still provides
significant throughput improvement over the legacy single-threaded
path while keeping peak RSS well under the container limit.

Resolves airbytehq/oncall#12714

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions

Copy link
Copy Markdown
Contributor

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

  • 🛠️ Quick Fixes
    • /format-fix - Fixes most formatting issues.
    • /bump-version - Bumps connector versions, scraping changelog description from the PR title.
      • Bump types: patch (default), minor, major, major_rc, rc, promote.
      • The rc type is a smart default: applies minor_rc if stable, or bumps the RC number if already RC.
      • The promote type strips the RC suffix to finalize a release.
      • Example: /bump-version type=rc or /bump-version type=minor
    • /bump-progressive-rollout-version - Alias for /bump-version type=rc. Bumps with an RC suffix and enables progressive rollout.
  • ❇️ AI Testing and Review (internal link: AI-SDLC Docs):
    • /ai-prove-fix - Runs prerelease readiness checks, including testing against customer connections.
    • /ai-canary-prerelease - Rolls out prerelease to 5-10 connections for canary testing.
    • /ai-review - AI-powered PR review for connector safety and quality gates.
  • 📝 AI Documentation:
    • /ai-docs-review - AI-powered documentation review for PRs with connector changes.
    • /ai-create-docs-pr - Creates a documentation PR for connector changes, stacked on the current PR.
  • 🚀 Connector Releases:
    • /publish-connectors-prerelease - Publishes pre-release connector builds (tagged as {version}-preview.{git-sha}) for all modified connectors in the PR.
  • ☕️ JVM connectors:
    • /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
      Example: /update-connector-cdk-version connector=destination-bigquery
  • 🐍 Python connectors:
    • /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
    • /poe source example lock - Alias for /poe connector source-example lock.
    • /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
    • /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.
  • ⚙️ Admin commands:
    • /force-merge reason="<REASON>" - Force merges the PR using admin privileges, bypassing CI checks. Requires a reason.
      Example: /force-merge reason="CI is flaky, tests pass locally"
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@devin-ai-integration devin-ai-integration Bot added the hyd-fix Hydra: ai-fix stage has run label May 25, 2026
Co-Authored-By: bot_apk <apk@cognition.ai>
@@ -356,6 +356,7 @@ This connector utilizes the open source [Unstructured](https://unstructured-io.g

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change

@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Deploy preview for airbyte-docs ready!

Project:airbyte-docs
Status: ✅  Deploy successful!
Preview URL:https://airbyte-docs-at2bt4l3l-airbyte-growth.vercel.app
Latest Commit:ec6e32b

Deployed with vercel-action

@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

source-s3 Connector Test Results

219 tests   196 ✅  7m 25s ⏱️
  3 suites   23 💤
  3 files      0 ❌

Results for commit ec6e32b.

♻️ This comment has been updated with latest results.

@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

↪️ Triggering /ai-review per Hands-Free AI Triage Project triage next step.

Reason: Draft source-s3 fix PR has passing connector checks and prove-fix evidence; AI review is the next Hydra stage before human merge consideration.

Devin session

@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

↪️ Intended to trigger /ai-review per Hands-Free AI Triage Project triage next step, but workflow dispatch is currently failing for AI command workflows.

Reason: Draft source-s3 fix PR has passing connector checks and prove-fix evidence; AI review is the next Hydra stage before human merge consideration.

Operational note: GitHub workflow_dispatch returned HTTP 500/403 with the configured trigger token; repository_dispatch does not start this workflow because the workflow does not subscribe to repository_dispatch. This has been reported as an Ops MCP/workflow blocker.

Devin session

@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

↪️ Triggered /ai-prove-fix per Hands-Free AI Triage Project PR pipeline sweep.

Reason: Draft source-s3 fix PR is linked to oncall issue https://github.com/airbytehq/oncall/issues/12714 and has no visible prove-fix run; prove-fix is the next Hydra validation stage.


Devin session

@octavia-bot

octavia-bot Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

🔍 AI Prove Fix session starting... Running readiness checks and testing against customer connections. View playbook

Devin AI session created successfully!

@devin-ai-integration

devin-ai-integration Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Fix Validation Evidence

Outcome: In Progress

Evidence Summary

Preparing validation for source-s3 PR #78416. Pre-flight checks passed, hyd-prove is applied, and pre-release publish is running for airbyte/source-s3:4.15.5-preview.ec6e32b. Required regression test attempt 1 is queued: https://github.com/airbytehq/airbyte-ops-mcp/actions/runs/26509877251

Evidence Plan

Proving Criteria

  • The target build uses a 20-worker source-s3 concurrent reader cap instead of the prior 100-worker file-based CDK pool.
  • A source-s3 regression test against real S3 credentials completes without unexpected SPEC, CHECK, DISCOVER, or READ regressions.
  • If regression testing exercises a large/memory-sensitive S3 read, no Source memory usage exceeded critical threshold or exit 137 OOM occurs.

Disproving Criteria

  • Regression tests show unexpected differences unrelated to the intended concurrency cap.
  • The same memory-monitor/OOM failure persists on a connection that exercises the affected large-file code path.
  • New source-s3 failures appear during CHECK, DISCOVER, or READ after applying the target build.

Testing Strategy Decision

I will start with required source connector regression tests in comparison mode using GSM integration-test credentials. This is a source connector and is not a forced OAuth write-back connector. Because the reported issue is memory pressure on a large customer stream, GSM regression tests may only prove no broad regression, not the exact OOM fix. If the regression report is clean but does not exercise the large-file/OOM path, live connection testing would require explicit human approval before pinning any customer/internal connection.

Ranked Cases

  1. GSM source-s3 regression test credentials, all streams — required first attempt; proves broad source-s3 behavior did not regress.
  2. Private oncall-reported affected connection — strongest proving case for the memory issue, but the source actor is already pinned to a previous preview version, so it is not eligible for default live pinning without explicit human approval.
  3. Other recent unpinned failed source-s3 connections from production query — backup candidates only after log qualification confirms the same memory failure pattern and after required approval for live pinning.
  4. Internal healthy source-s3 connection — none found in the last 2 days via the initial internal query; would only prove lack of regression.
Pre-flight Checks
  • Viability: the PR sets SourceS3._concurrency_level = 20; companion CDK PR 1035 makes FileBasedSource.__init__ use that value to size ConcurrentSource workers.
  • Safety: no suspicious code, obfuscation, credential handling, external calls, or data-exfiltration patterns in the changed files.
  • Design intent: the prior 100-worker behavior came from the CDK default/concurrent migration; the PR intentionally reduces source-s3 concurrency to prevent 2 Gi OOMs.
  • Breaking Change: no schema/spec/state/stream/PK/cursor changes detected; no breaking metadata signals.
  • Pin Exclusion: direct affected source actor is already pinned, so it is excluded from default live testing unless a human explicitly approves overwrite risk.
  • Reversibility: patch version bump only; rollback to the previous connector version does not require state or config migration.
Current Detailed Log
  • Issue context: private oncall issue reports source memory monitor threshold failure during sync.
  • PR diff: changed metadata.yaml, pyproject.toml, source_s3/v4/source.py, unit_tests/v4/test_source.py, and docs changelog only.
  • Labeling: added hyd-prove to the public PR and private oncall issue.
  • Pre-release: triggered publish workflow https://github.com/airbytehq/airbyte/actions/runs/26509599730 for airbyte/source-s3:4.15.5-preview.ec6e32b.
  • Regression test attempt 1: queued comparison-mode GSM run at https://github.com/airbytehq/airbyte-ops-mcp/actions/runs/26509877251.
  • Local static/smoke evidence: the new source-s3 unit assertion passed locally when temporarily using the companion CDK branch from airbytehq/airbyte-python-cdk PR 1035. The temporary local dependency pin was reverted and not committed.

Devin session

@devin-ai-integration devin-ai-integration Bot added the hyd-prove Hydra: ai-prove-fix stage has run label May 27, 2026
@github-actions

github-actions Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Pre-release Connector Publish Started

Publishing pre-release build for connector source-s3.
PR: #78416

Pre-release versions will be tagged as {version}-preview.ec6e32b
and are available for version pinning via the scoped_configuration API.

View workflow run
⚠️ Pre-release Publish CANCELLED for source-s3.

@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

Fix Validation Evidence

Outcome: Regression evidence clean; live OOM-path proof still needs human-approved pin/sync testing.

Evidence Summary

Pre-flight checks passed for #78416: viable, safe, non-breaking, design-aligned, and reversible. The source-s3 prerelease image airbyte/source-s3:4.15.5-preview.ec6e32b was built and pushed, and required source connector regression testing passed in comparison mode: https://github.com/airbytehq/airbyte-ops-mcp/actions/runs/26509877251

Regression results:

Command Result
SPEC Passed: control 4.15.4 and target dev both exit 0
CHECK Passed: control 4.15.4 and target dev both exit 0
DISCOVER Passed: control 4.15.4 and target dev both exit 0
READ Passed: control 4.15.4 and target dev both exit 0; identical output counts

READ emitted identical results for control and target: 10 records on stream test, 1 state, 4 traces, and 12 logs. Artifact scan found no Source memory usage exceeded critical threshold, exit 137, or traceback in the regression outputs.

Interpretation: This proves no broad SPEC/CHECK/DISCOVER/READ regression for the source-s3 GSM test path. It does not fully prove the original large-stream OOM fix because the regression READ used a small test stream and did not exercise the reported high-memory customer workload.

Evidence Plan and Detailed Results

Proving Criteria

  • The target build uses a 20-worker source-s3 concurrent reader cap instead of the prior 100-worker file-based CDK pool.
  • A source-s3 regression test against real S3 credentials completes without unexpected SPEC, CHECK, DISCOVER, or READ regressions.
  • If a large/memory-sensitive live read is approved and executed, no Source memory usage exceeded critical threshold or exit 137 OOM occurs.

Disproving Criteria

  • Regression tests show unexpected differences unrelated to the intended concurrency cap.
  • The same memory-monitor/OOM failure persists on a connection that exercises the affected large-file code path.
  • New source-s3 failures appear during CHECK, DISCOVER, or READ after applying the target build.

Pre-release Publish Notes

Workflow: https://github.com/airbytehq/airbyte/actions/runs/26509599730

The connector image was built and pushed:

  • docker.io/airbyte/source-s3:4.15.5-preview.ec6e32b
  • Multi-arch manifest digest: sha256:0868e4147d842c8cfcb8fc9dddd13380ceec7a85ef1555a5823c673024dfc517

Versioned connector metadata artifacts were also published for 4.15.5-preview.ec6e32b. The workflow later marked the overall run cancelled because the downstream registry compile job was cancelled, so I am not treating the overall publish workflow as fully successful; the image and versioned metadata needed for pinning are present.

Regression Test Details

Workflow: https://github.com/airbytehq/airbyte-ops-mcp/actions/runs/26509877251

Mode:

  • Connector: source-s3
  • Catalog mode: GSM-generated catalog
  • Comparison mode: skip_compare=false
  • Control: 4.15.4
  • Target: dev

Results:

  • SPEC: both versions succeeded; message counts identical: 1 LOG, 1 SPEC.
  • CHECK: both versions succeeded; message counts identical: 1 CONNECTION_STATUS, 3 LOG.
  • DISCOVER: both versions succeeded; message counts identical: 1 CATALOG, 4 LOG.
  • READ: both versions succeeded; message counts identical: 12 LOG, 10 RECORD, 1 STATE, 4 TRACE; stream test emitted 10 records for both versions.

Live Testing Status

The original customer-specific details and candidate connection IDs are documented only on the private oncall issue: https://github.com/airbytehq/oncall/issues/12714#issuecomment-4554429470

I requested human approval for one live TIER_2 pin/sync test using this prerelease. Without that approval, I did not pin or run customer connections.

Recommendation

Treat the PR as having clean regression evidence but not direct OOM reproduction proof yet. For higher confidence before merge or broader rollout, use /ai-canary-prerelease or approve one scoped live pin/sync test through the human-in-the-loop request.


Devin session

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

connectors/source/s3 hyd-fix Hydra: ai-fix stage has run hyd-prove Hydra: ai-prove-fix stage has run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant